#1
|
|||
|
|||
![]()
Hi,
Im trying to crawl all pictures and Video of a few tumblr blogs Im following. Like: **External links are only visible to Support Staff****External links are only visible to Support Staff** i noticed that the filesize of the resulting folder isnt constant even nothing on the blog seems to be added or removed. For example: JDownloader shows me there are 10706 Pictures online. A download results in a folder with 10706 Pictures. If I do that again, Ill get another folder with 10706 Pictures. But the content isnt exactly the same. Its kinda random, several pictures are present in the first crawl and not in the second and vice versa. I would assume that every time i should get the same result. If i use the "Check online status" function JDownloader adds extra files instead of just verifying the online state of the already present files. |
#2
|
||||
|
||||
![]()
@yk1649: "Check online status" cannot add more files as there is absolut no support/function for this. All it does, is to check the selected file(s), this cannot add more files. More likely there is still crawling going on (see icon bottom right corner) and JDownloader still crawling/adding files.
Can also be that the pagination of site is unstable and not returning the same items in new loop/walk through. You can test by adding the links, then wait till complete, and then add again and check if the number of items in linkgrabber increases
__________________
JD-Dev & Server-Admin Last edited by Jiaz; 12.04.2024 at 11:29. |
#3
|
||||
|
||||
![]()
Tumblr.com has a lot of serverside bugs so in general, I wouldn't trust anything that it returns.
According to my logs, that blog contains 12942 posts though: - A post can contain multiple downloadable items items - The total post count also includes deleted posts I got 12893 results: 12882 tumblr.com items and 14 youtube items. After adding/crawling the same link again, I got 39 items more though I wasn't able to find out where they came from - maybe they were added in the meantime.
__________________
JD Supporter, Plugin Dev. & Community Manager
Erste Schritte & Tutorials || JDownloader 2 Setup Download |
#4
|
|||
|
|||
![]()
Hi,
thanks for the quick response. I did some more testing and seems like at least the question of where the additional files from "Check online status" came from is solved. Turns out files that are in the "Others" category, normally files like gifv or pnj as placeholders for removed images, are actually found and the counter of the images increases. I still got different crawl results after two tries. Even the amount is relatively small. Just tells me the crawl is incomplete, because i dont really know whats missing. For this test i only focused on images on tumblr.com that are actually online. Test #1 simple crawl: offline files: 3 file types audio file: 3 document file: 5 image: 10626 others: 101 video file: 2187 hoster tumbler.com: 12908 youtube.com: 14 after "online status check": offline files: 3 file types audio file: 3 document file: 5 image: 10727 others: 0 video file: 2187 hoster tumbler.com: 12908 youtube.com: 14 10724 images on tumblr.com online downloaded: 10713 jpg 1 png 10 webp Test #2 simple crawl: offline files: 3 file types audio file: 3 document file: 5 image: 10627 others: 100 video file: 2187 hoster tumbler.com: 12908 youtube.com: 14 after "online status check": offline files: 3 file types audio file: 3 document file: 5 image: 10727 others: 0 video file: 2187 hoster tumbler.com: 12908 youtube.com: 14 10724 images on tumblr.com online downloaded: 10713 jpg 1 png 10 webp Same amount of images has been downloaded. But the two resulting folder differ by 54 files. Some file are present in the first crawl and missing in the other and vice versa. Last edited by yk1649; 13.04.2024 at 01:05. |
#5
|
|||
|
|||
![]()
p.s. nothing has been added to the blog while i was testing
|
#6
|
||||
|
||||
![]() Quote:
maybe the images are dfifferent because different cdn server have image in different compressed version and/or other factors. you could check the files by their filename and then check in JDownloader for the download url via rightclick menu. please understand that it helps us a lot that the better you can reproduce and / or limit the cause, the easier/faster we can find/fix it
__________________
JD-Dev & Server-Admin |
#7
|
|||
|
|||
![]()
Hi,
crawled the same blog another two times and came up with 10724 images. 52 of those arent in both crawls. This is in regards to their filenames. 26 are in the first crawl and not in the second and vice versa. I did a folder comparson with meld and noticed an interesting pattern. The files are sorted alphabetically. And the missing files seem to appear in pairs. Thought its not always the case at least in other crawls i had. I used a shell script to extract those files and they are all unique, means no image present in different resolutions. Which is what i would expect if just different resolutions of the same image would been downloaded. It would help a lot if jdownloader would have a function to export the download list as text file. |
#8
|
|||
|
|||
![]()
p.s. embedding images doesnt seem to work or i missed something
**External links are only visible to Support Staff****External links are only visible to Support Staff** |
#9
|
||||
|
||||
![]()
I deep dived into this and issue is with tumblr api being inconsistent. Crawling the same 12908 twice does return majority of same files/links BUT between two runs there are randomly results that were not part of previous run. for example now on page 226 there are items that were not part of the whole previous run. and on page 276 and so on. total randomly but happens more often after 50% progress.
because the total number of posts is steadily increasing, this could also be caused by new items getting insert during pagination and sometimes new items do show up during crawling or not. during my testing the number of total posts did increase meaning that the crawling will not be able to find all as already some posts might be missing. I don't see any real solution except multiple crawling runs
__________________
JD-Dev & Server-Admin Last edited by Jiaz; 14.04.2024 at 01:07. |
#10
|
||||
|
||||
![]()
I told you that their WebAPI is very unreliable.
The only real solution would be to report this to tumblr.com but I'm unsure on how open they are towards crawler applications such as JDownloader. I've updated the prefix of this thread accordingly.
__________________
JD Supporter, Plugin Dev. & Community Manager
Erste Schritte & Tutorials || JDownloader 2 Setup Download |
#11
|
|||
|
|||
![]()
Ok,
thanks for the help. Ill try to crawl the blogs more than once. I opened another thread with another issue i got with tumblr. Not sure if that is related. |
#12
|
||||
|
||||
![]()
@yk1649: Add/crawl and then wait a few minutes and crawl same link again. In my tests a few *new* files/entries did show up then.
__________________
JD-Dev & Server-Admin |
![]() |
Thread Tools | |
Display Modes | |
|
|