JDownloader Community - Appwork GmbH
 

Notices

Reply
 
Thread Tools Display Modes
  #1  
Old 29.12.2019, 17:08
doiulou doiulou is offline
Baby Loader
 
Join Date: Dec 2019
Posts: 6
Default Twitter - Scrape date/time limit?

Sorry if I am posting this in the wrong section or am using JDownloader wrong.

I am trying to scrape a twitter profile (all it's media) but somehow with JDownloader I can not get any futher than 19 March 2019? Everything before March 2019 is not downloaded. How can I download these ?

Kind regards,

dioulou
Reply With Quote
  #2  
Old 13.01.2020, 13:43
Akasen Akasen is offline
DSL Light User
 
Join Date: Jul 2017
Posts: 34
Default Twitter - Linkgrabber not doing complete scrape

So I've noticed something recently that's absolutely troubling and I don't know if this has been noted.

While I'm painfully aware of another issue plaguing the Twitter side of things, I feel that at the very I should do my part in at least reporting these things when noticed.

What I'm finding is that when the links are being grabbed, the linkgrabber only goes so far in the past of uploaded files. It's honestly not easy to pin how far Jdownloader goes, or why, but it just does.

So for example on my end, downloading the media of this artist is NEARLY complete, but somehow the last three images (not counting the youtube video posted) just don't get picked up by the Linkgrabber

**External links are only visible to Support Staff****External links are only visible to Support Staff**

And a log
13.01.20 06.17.42 <--> 13.01.20 06.43.06 jdlog://9377330900751/

This test can be repeated across other twitter accounts as well.
Reply With Quote
  #3  
Old 14.01.2020, 01:26
Akasen Akasen is offline
DSL Light User
 
Join Date: Jul 2017
Posts: 34
Default

Funny you're posting this, cause I noticed a similar thing as well. In my case, I was also noticing with someone who hadn't posted since 2017, I wasn't able to get all the stuff from them.

https://board.jdownloader.org/showthread.php?t=82777
Reply With Quote
  #4  
Old 20.01.2020, 21:19
pspzockerscene's Avatar
pspzockerscene pspzockerscene is offline
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 70,922
Default

Twitter web-API seems to have a limit which you should even be able to see via browser.
An old note of mine in our code will also tell this:
Code:
            /* 2016-11-30: Seems like twitter limits their website to a max "load more" calls of 40. */
40 would mean 40 times loading content by scrolling down.

As said - you can simply test this via browser.

I'am not doing this right now as I don't have the time for that (today).

-psp-
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
Reply With Quote
  #5  
Old 10.04.2020, 01:31
doiulou doiulou is offline
Baby Loader
 
Join Date: Dec 2019
Posts: 6
Default

Hi,

Basically the Linkgrabber never fetches all links.

This is W10, latest version of jdownloader

10.04.20 06.06.28 <--> 10.04.20 06.30.38 jdlog://0684815302851/

I did the attempt to get all links multiple times during the above log. So no download issues, just not all links are fetched

Last edited by doiulou; 10.04.2020 at 01:34.
Reply With Quote
  #6  
Old 12.04.2020, 14:45
Coldblackice Coldblackice is offline
Wind Gust
 
Join Date: Sep 2019
Location: San Francisco
Posts: 40
Default

I'm also having a similar issue. A site I'm trying to download links from happens to have some links to Twitter in its source. This leads to a endless rabbit hole which the Linkcrawler never finishes from. I've tried adding filters against twitter.com and twimg.com, but linkcrawler still churns away endlessly. Is there something I'm missing?
Reply With Quote
  #7  
Old 12.04.2020, 17:31
doiulou doiulou is offline
Baby Loader
 
Join Date: Dec 2019
Posts: 6
Default

In my case it does finish crawling and proceeds with downloading. It just quits finding links after a certain date.
Reply With Quote
  #8  
Old 13.04.2020, 09:51
Coldblackice Coldblackice is offline
Wind Gust
 
Join Date: Sep 2019
Location: San Francisco
Posts: 40
Default

Quote:
Originally Posted by doiulou View Post
In my case it does finish crawling and proceeds with downloading. It just quits finding links after a certain date.
Are you able to see how many links it is able to grab? Does this number stay the same between restarts of JDownloader?
Reply With Quote
  #9  
Old 13.04.2020, 19:32
doiulou doiulou is offline
Baby Loader
 
Join Date: Dec 2019
Posts: 6
Default

Quote:
Originally Posted by Coldblackice View Post
Are you able to see how many links it is able to grab? Does this number stay the same between restarts of JDownloader?
I believe so yes!
Reply With Quote
  #10  
Old 14.04.2020, 09:09
Akasen Akasen is offline
DSL Light User
 
Join Date: Jul 2017
Posts: 34
Default

Hi, I thought I'd chime in here since this thread is alive still and I essentially got the answer to this problem from PSP in another thread

Quote:
Originally Posted by pspzockerscene View Post
Hi again Akasen,

I've investigated this.

It seems like twitter has a max. number of tweets you can see/go back.
Code:
641|twitter.com_jd.plugins.decrypter.TwitterCom 31.03.20 15:36:59 - INFO [ jd.plugins.decrypter.TwitterCom(crawlUserViaAPI) ] -> Numberof tweets on current page: 0 of expected max 20
641|twitter.com_jd.plugins.decrypter.TwitterCom 31.03.20 15:37:01 - INFO [ jd.plugins.decrypter.TwitterCom(crawlUserViaAPI) ] -> Numberof total tweets crawled: 829 of expected total 2748
By default, twitter will return 20 tweets per page --> Every tweet may contain a different amount of downloadable media.

Basically we noticed this in the past as well.
You can even check this via browser by going back as far as 41 "pages" which means 40x reloading by scrolling down.

The issue is also that a lot of websites would allow to e.g. "start at position 500".
Twitter however uses so called "cursors" which means to access the next page, you will have to get a token which is only available on the previous page so even if I wanted, I would not be able to give you any options to e.g. start at position 800 in this case.

For your other URL, it finds 200 objects which should be all as it only contains 162 tweets.

I could now e.g. experiment and display more objects per page but although thiy may return some more objects, I we would run into similar issues with URLs containing even more objects.

I recommend you to:
- Test via browser and see how far you can get and if you can e.g. get more than JD does
- Search the Internet for other Twitter downloader tools --> If you find one that does a better job than our crawler, let me know and I'll look into it again

-psp-
My experience so far in reporting these issues to PSP in another thread has been that of "steps forward, some steps back" with regards to Twitter. The API is just weird and Twitter is likely constantly working against the efforts of things like Jdownloader

Quote:
Originally Posted by pspzockerscene View Post
Hm as said, I get 200 items when I add that one of your two URLs.
Now I even get 201.

According to the github tickets of this other software, the twitter API is kinda random.

Unfortunately I do not have the time to do big experiments on it and it seems like it is working fine for most of all of our current users so I do not want to add experimental code.

According to the tickets, changing the "filter" values and also the User-Agent may bring more results.

We are open source so if you want you can grab our code and play around with it:
**External links are only visible to Support Staff**...

-psp-
The best thing that probably can be done at this rate is for a large group of people interested in maintaining the Twitter plugin and gaining insight into it to take up the code and start documenting and experimenting with the plugin. I have the jdownloader code downloaded myself, but I'm not able to focus entirely on figuring out and experimenting with Jdownloader and Twitter.

Last edited by Akasen; 14.04.2020 at 09:14.
Reply With Quote
  #11  
Old 17.04.2020, 17:34
doiulou doiulou is offline
Baby Loader
 
Join Date: Dec 2019
Posts: 6
Default

I partially understand as I've implemented quite some APIs.

I guess you have to scrape page per page instead of through the API. Although I have a hunch you can do it through the API ?

You can e.g. visit the URL:
twitter/search?q=(from%3ABarackObama)%20until%3A2019-03-20%20since%3A2019-01-01&src=typed_query

Which will give you Barack Obama's tweet between these 2 dates, even if these dates are well past what the API allows. If you can just provide the above parameters to twitter.com/search... you can scrape any kind of page?
Reply With Quote
  #12  
Old 20.04.2020, 14:04
pspzockerscene's Avatar
pspzockerscene pspzockerscene is offline
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 70,922
Default

Do you have a working example for me?

-psp-
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
Reply With Quote
  #13  
Old 24.04.2020, 01:12
Akasen Akasen is offline
DSL Light User
 
Join Date: Jul 2017
Posts: 34
Default

I have been checking back to see if doiulou ever expanded on their idea in any way, explanation or example of code.

I assume what they have in mind is that rather than having jdownloader simply look only at a user profile and make the calls from their, have it so that Jdownloader does the scraping in controlled searches by date between now and whatever time the profile first started.

It does appear that twitter from the public facing side of things allows for that as it stands, so to someone it should stand to reason that if you take the twitter username and do a series of "(from:[USERNAME-HANDLE]) until:END_DATE since: START_DATE" and basically go down the dates, you might be able to get around twitter's current limit on the API and get the links for all of the media on an account.

Of course, that's assuming the twitter dev's haven't already thought of that.

I'm merely trying to expand what might have been on the table though, I don't know much myself on API's and all that, nor an intimate understanding of Twitter's.

For all I know, doiulou might have been going a different direction. What i do know though is that you can't do a search by time that's too large, like trying to look through tweets between a span of 2019-01-01 to 2018-01-01. Though I do think anything less than maybe 300 days seems to work just fine, but I'm going into probably unnecessary details at the moment.
Reply With Quote
  #14  
Old 11.05.2020, 23:30
doiulou doiulou is offline
Baby Loader
 
Join Date: Dec 2019
Posts: 6
Default

Sorry I'm not in this forum all the time but yeah basically that is what I thought.

Just scrape over date ranges....
Reply With Quote
  #15  
Old 12.05.2020, 15:56
pspzockerscene's Avatar
pspzockerscene pspzockerscene is offline
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 70,922
Default

We are using this API (= website) at the moment:

Code:
api.twitter.com/2/timeline/media/CENSORED.json?include_profile_interstitial_type=1&include_blocking=1&include_blocked_by=1[...]
As said we are open source and code can be contributed ...
Does this Web-API accept the above mentioned parameters?

-psp-
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
Reply With Quote
  #16  
Old 23.09.2020, 18:23
pspzockerscene's Avatar
pspzockerscene pspzockerscene is offline
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 70,922
Default

Small update:
Seems like there is an open source tool out there that can crawl beyong the limits:
github.com/twintproject/twint

I didn't (yet) investigate that but I'm also unsure whether or not I'll find enough time for that.
Either way maybe this tool is helpful for those who were looking on a way to go beyong this limit.

-psp-
EDIT

Ticket:
__________________
JD Supporter, Plugin Dev. & Community Manager

Erste Schritte & Tutorials || JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?

Last edited by pspzockerscene; 23.09.2020 at 18:57.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

All times are GMT +2. The time now is 12:08.
Provided By AppWork GmbH | Privacy | Imprint
Parts of the Design are used from Kirsch designed by Andrew & Austin
Powered by vBulletin® Version 3.8.10 Beta 1
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.