View Single Post
  #915  
Old 25.09.2019, 19:58
Demongornot Demongornot is offline
JD Beta
 
Join Date: Sep 2019
Location: Universe, Local group, Milky Way, Solar System, Earth, France
Posts: 50
Default

Well I tried "queryLinkCrawlerJobs" using this :
Code:
var eventX = event;
if (eventX.publisher == 'linkcrawler' && eventX.id == 'STARTED') {
    var dt = JSON.parse(eventX.data);
    var jid = dt.jobId;
    var cid = dt.crawlerId;
    var lscq = {
        "collectorInfo": true,
        "jobId": jid
    };
    alert((callAPI("linkgrabberv2", "queryLinkCrawlerJobs", lscq)));
}
But it return "[]" only, also if I get the term correctly, "Crawler ID" is the ID of a crawler searching links from a single URL while "Job ID" is basically the process of looking for links from all the URLs that have been put into JD, which encapsulate as many Crawler as there is URLs (which is why there can be multiple Crawler id ?) ?
Or did I got it wrong ? I mean between "Job", "Crawler Job" (though I guess those two are the same, but we never know) and "Crawler" I am not sure what is actually what...
But well yes I get a jobID and I would like to retrieve links from it.

The issue is that it look like a complicated mess, "queryLinkCrawlerJob" return nothing, and even if it did I would get a "List<JobLinkCrawler>" but the "JobLinkCrawler" isn't used by any API method, and I need to set "setAssignJobID" to "true" but the only place I find it is in "AddLinksQuery" which also isn't returned by any API methods, I can't find any "queryLinksParameter" either...

It look like it would require a really messy way of lot of API call to get from a JobID to the list of links it added...
If Only I could get an "added time" for crawled links, it would simplify things as I would simply look for the latest added one and find those who came from the same job using ".getSourceJob" but I don't know how to get jobID from that and ".getSourceJob" return "null", though with added date itself I could look for those older than the Job...
Also if from ".getContainerURL", ".getContentURL", ".getOriginURL", ".getReferrerURL" and ".getURL" I knew which one was actually the original URL the crawler used to search them, I could simply compare all those with the same xxxURL.

I'm out of ideas of how to only get the latest added links, even by going through the whole crawled link list.
Reply With Quote