JDownloader Community - Appwork GmbH
 

Reply
 
Thread Tools Display Modes
  #1  
Old 12.04.2024, 01:17
yk1649 yk1649 is offline
Baby Loader
 
Join Date: Apr 2024
Posts: 8
Default [NOT a JD bug] Tumblr.com crawling causes random results

Hi,

Im trying to crawl all pictures and Video of a few tumblr blogs Im following.

Like: **External links are only visible to Support Staff****External links are only visible to Support Staff**

i noticed that the filesize of the resulting folder isnt constant even nothing on the blog seems to be added or removed.

For example: JDownloader shows me there are 10706 Pictures online.
A download results in a folder with 10706 Pictures.
If I do that again, Ill get another folder with 10706 Pictures.
But the content isnt exactly the same.

Its kinda random, several pictures are present in the first crawl and not in the second and vice versa.

I would assume that every time i should get the same result.

If i use the "Check online status" function JDownloader adds extra files instead of just verifying the online state of the already present files.
Reply With Quote
  #2  
Old 12.04.2024, 11:06
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 79,532
Default

@yk1649: "Check online status" cannot add more files as there is absolut no support/function for this. All it does, is to check the selected file(s), this cannot add more files. More likely there is still crawling going on (see icon bottom right corner) and JDownloader still crawling/adding files.

Can also be that the pagination of site is unstable and not returning the same items in new loop/walk through. You can test by adding the links, then wait till complete, and then add again and check if the number of items in linkgrabber increases
__________________
JD-Dev & Server-Admin

Last edited by Jiaz; 12.04.2024 at 11:29.
Reply With Quote
  #3  
Old 12.04.2024, 12:33
pspzockerscene's Avatar
pspzockerscene pspzockerscene is offline
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 71,103
Default

Tumblr.com has a lot of serverside bugs so in general, I wouldn't trust anything that it returns.

According to my logs, that blog contains 12942 posts though:
- A post can contain multiple downloadable items items
- The total post count also includes deleted posts

I got 12893 results: 12882 tumblr.com items and 14 youtube items.

After adding/crawling the same link again, I got 39 items more though I wasn't able to find out where they came from - maybe they were added in the meantime.
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
Reply With Quote
  #4  
Old 13.04.2024, 01:01
yk1649 yk1649 is offline
Baby Loader
 
Join Date: Apr 2024
Posts: 8
Default

Hi,

thanks for the quick response.

I did some more testing and seems like at least the question of where
the additional files from "Check online status" came from is solved.
Turns out files that are in the "Others" category,
normally files like gifv or pnj as placeholders for removed images,
are actually found and the counter of the images increases.

I still got different crawl results after two tries.
Even the amount is relatively small.
Just tells me the crawl is incomplete, because i dont really know whats missing.

For this test i only focused on images on tumblr.com that are actually online.

Test #1

simple crawl:

offline files: 3

file types
audio file: 3
document file: 5
image: 10626
others: 101
video file: 2187

hoster
tumbler.com: 12908
youtube.com: 14

after "online status check":

offline files: 3

file types
audio file: 3
document file: 5
image: 10727
others: 0
video file: 2187

hoster
tumbler.com: 12908
youtube.com: 14

10724 images on tumblr.com online

downloaded:
10713 jpg
1 png
10 webp

Test #2

simple crawl:

offline files: 3

file types
audio file: 3
document file: 5
image: 10627
others: 100
video file: 2187

hoster
tumbler.com: 12908
youtube.com: 14

after "online status check":

offline files: 3

file types
audio file: 3
document file: 5
image: 10727
others: 0
video file: 2187

hoster
tumbler.com: 12908
youtube.com: 14

10724 images on tumblr.com online

downloaded:
10713 jpg
1 png
10 webp

Same amount of images has been downloaded.
But the two resulting folder differ by 54 files.
Some file are present in the first crawl and missing in the other and vice versa.

Last edited by yk1649; 13.04.2024 at 01:05.
Reply With Quote
  #5  
Old 13.04.2024, 01:02
yk1649 yk1649 is offline
Baby Loader
 
Join Date: Apr 2024
Posts: 8
Default

p.s. nothing has been added to the blog while i was testing
Reply With Quote
  #6  
Old 13.04.2024, 12:49
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 79,532
Default

Quote:
Originally Posted by yk1649 View Post
Same amount of images has been downloaded.
But the two resulting folder differ by 54 files.
Do you mean that 54 files are different? or that the number of files is different by 54?

maybe the images are dfifferent because different cdn server have image in different compressed version
and/or other factors. you could check the files by their filename and then check in JDownloader for the
download url via rightclick menu. please understand that it helps us a lot that the better you can reproduce
and / or limit the cause, the easier/faster we can find/fix it
__________________
JD-Dev & Server-Admin
Reply With Quote
  #7  
Old 13.04.2024, 21:50
yk1649 yk1649 is offline
Baby Loader
 
Join Date: Apr 2024
Posts: 8
Default

Hi,

crawled the same blog another two times and came up with 10724 images.

52 of those arent in both crawls. This is in regards to their filenames.

26 are in the first crawl and not in the second and vice versa.

I did a folder comparson with meld and noticed an interesting pattern.



The files are sorted alphabetically. And the missing files seem to appear in pairs. Thought its not always the case at least in other crawls i had.

I used a shell script to extract those files and they are all unique,
means no image present in different resolutions.
Which is what i would expect if just different resolutions of the same image would been downloaded.

It would help a lot if jdownloader would have a function to export the download list as text file.
Reply With Quote
  #8  
Old 13.04.2024, 21:51
yk1649 yk1649 is offline
Baby Loader
 
Join Date: Apr 2024
Posts: 8
Default

p.s. embedding images doesnt seem to work or i missed something

**External links are only visible to Support Staff****External links are only visible to Support Staff**
Reply With Quote
  #9  
Old 14.04.2024, 00:36
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 79,532
Default

I deep dived into this and issue is with tumblr api being inconsistent. Crawling the same 12908 twice does return majority of same files/links BUT between two runs there are randomly results that were not part of previous run. for example now on page 226 there are items that were not part of the whole previous run. and on page 276 and so on. total randomly but happens more often after 50% progress.

because the total number of posts is steadily increasing, this could also be caused by new items getting insert during pagination and sometimes new items do show up during crawling or not. during my testing the number of total posts did increase meaning that the crawling will not be able to find all as already some posts might be missing.

I don't see any real solution except multiple crawling runs
__________________
JD-Dev & Server-Admin

Last edited by Jiaz; 14.04.2024 at 01:07.
Reply With Quote
  #10  
Old 15.04.2024, 16:18
pspzockerscene's Avatar
pspzockerscene pspzockerscene is offline
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 71,103
Default

I told you that their WebAPI is very unreliable.

The only real solution would be to report this to tumblr.com but I'm unsure on how open they are towards crawler applications such as JDownloader.
I've updated the prefix of this thread accordingly.
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
Reply With Quote
  #11  
Old 15.04.2024, 21:22
yk1649 yk1649 is offline
Baby Loader
 
Join Date: Apr 2024
Posts: 8
Default

Ok,

thanks for the help. Ill try to crawl the blogs more than once.

I opened another thread with another issue i got with tumblr. Not sure if that is related.
Reply With Quote
  #12  
Old 15.04.2024, 21:32
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 79,532
Default

@yk1649: Add/crawl and then wait a few minutes and crawl same link again. In my tests a few *new* files/entries did show up then.
__________________
JD-Dev & Server-Admin
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

All times are GMT +2. The time now is 02:20.
Provided By AppWork GmbH | Privacy | Imprint
Parts of the Design are used from Kirsch designed by Andrew & Austin
Powered by vBulletin® Version 3.8.10 Beta 1
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.