#1
|
|||
|
|||
Twitter - Scrape date/time limit?
Sorry if I am posting this in the wrong section or am using JDownloader wrong.
I am trying to scrape a twitter profile (all it's media) but somehow with JDownloader I can not get any futher than 19 March 2019? Everything before March 2019 is not downloaded. How can I download these ? Kind regards, dioulou |
#2
|
|||
|
|||
Twitter - Linkgrabber not doing complete scrape
So I've noticed something recently that's absolutely troubling and I don't know if this has been noted.
While I'm painfully aware of another issue plaguing the Twitter side of things, I feel that at the very I should do my part in at least reporting these things when noticed. What I'm finding is that when the links are being grabbed, the linkgrabber only goes so far in the past of uploaded files. It's honestly not easy to pin how far Jdownloader goes, or why, but it just does. So for example on my end, downloading the media of this artist is NEARLY complete, but somehow the last three images (not counting the youtube video posted) just don't get picked up by the Linkgrabber **External links are only visible to Support Staff****External links are only visible to Support Staff** And a log 13.01.20 06.17.42 <--> 13.01.20 06.43.06 jdlog://9377330900751/ This test can be repeated across other twitter accounts as well. |
#3
|
|||
|
|||
Funny you're posting this, cause I noticed a similar thing as well. In my case, I was also noticing with someone who hadn't posted since 2017, I wasn't able to get all the stuff from them.
https://board.jdownloader.org/showthread.php?t=82777 |
#4
|
||||
|
||||
Twitter web-API seems to have a limit which you should even be able to see via browser.
An old note of mine in our code will also tell this: Code:
/* 2016-11-30: Seems like twitter limits their website to a max "load more" calls of 40. */ As said - you can simply test this via browser. I'am not doing this right now as I don't have the time for that (today). -psp-
__________________
JD Supporter, Plugin Dev. & Community Manager
Erste Schritte & Tutorials || JDownloader 2 Setup Download |
#5
|
|||
|
|||
Hi,
Basically the Linkgrabber never fetches all links. This is W10, latest version of jdownloader 10.04.20 06.06.28 <--> 10.04.20 06.30.38 jdlog://0684815302851/ I did the attempt to get all links multiple times during the above log. So no download issues, just not all links are fetched Last edited by doiulou; 10.04.2020 at 01:34. |
#6
|
|||
|
|||
I'm also having a similar issue. A site I'm trying to download links from happens to have some links to Twitter in its source. This leads to a endless rabbit hole which the Linkcrawler never finishes from. I've tried adding filters against twitter.com and twimg.com, but linkcrawler still churns away endlessly. Is there something I'm missing?
|
#7
|
|||
|
|||
In my case it does finish crawling and proceeds with downloading. It just quits finding links after a certain date.
|
#8
|
|||
|
|||
Are you able to see how many links it is able to grab? Does this number stay the same between restarts of JDownloader?
|
#9
|
|||
|
|||
I believe so yes!
|
#10
|
|||
|
|||
Hi, I thought I'd chime in here since this thread is alive still and I essentially got the answer to this problem from PSP in another thread
Quote:
Quote:
Last edited by Akasen; 14.04.2020 at 09:14. |
#11
|
|||
|
|||
I partially understand as I've implemented quite some APIs.
I guess you have to scrape page per page instead of through the API. Although I have a hunch you can do it through the API ? You can e.g. visit the URL: twitter/search?q=(from%3ABarackObama)%20until%3A2019-03-20%20since%3A2019-01-01&src=typed_query Which will give you Barack Obama's tweet between these 2 dates, even if these dates are well past what the API allows. If you can just provide the above parameters to twitter.com/search... you can scrape any kind of page? |
#12
|
||||
|
||||
Do you have a working example for me?
-psp-
__________________
JD Supporter, Plugin Dev. & Community Manager
Erste Schritte & Tutorials || JDownloader 2 Setup Download |
#13
|
|||
|
|||
I have been checking back to see if doiulou ever expanded on their idea in any way, explanation or example of code.
I assume what they have in mind is that rather than having jdownloader simply look only at a user profile and make the calls from their, have it so that Jdownloader does the scraping in controlled searches by date between now and whatever time the profile first started. It does appear that twitter from the public facing side of things allows for that as it stands, so to someone it should stand to reason that if you take the twitter username and do a series of "(from:[USERNAME-HANDLE]) until:END_DATE since: START_DATE" and basically go down the dates, you might be able to get around twitter's current limit on the API and get the links for all of the media on an account. Of course, that's assuming the twitter dev's haven't already thought of that. I'm merely trying to expand what might have been on the table though, I don't know much myself on API's and all that, nor an intimate understanding of Twitter's. For all I know, doiulou might have been going a different direction. What i do know though is that you can't do a search by time that's too large, like trying to look through tweets between a span of 2019-01-01 to 2018-01-01. Though I do think anything less than maybe 300 days seems to work just fine, but I'm going into probably unnecessary details at the moment. |
#14
|
|||
|
|||
Sorry I'm not in this forum all the time but yeah basically that is what I thought.
Just scrape over date ranges.... |
#15
|
||||
|
||||
We are using this API (= website) at the moment:
Code:
api.twitter.com/2/timeline/media/CENSORED.json?include_profile_interstitial_type=1&include_blocking=1&include_blocked_by=1[...] Does this Web-API accept the above mentioned parameters? -psp-
__________________
JD Supporter, Plugin Dev. & Community Manager
Erste Schritte & Tutorials || JDownloader 2 Setup Download |
#16
|
||||
|
||||
Small update:
Seems like there is an open source tool out there that can crawl beyong the limits: github.com/twintproject/twint I didn't (yet) investigate that but I'm also unsure whether or not I'll find enough time for that. Either way maybe this tool is helpful for those who were looking on a way to go beyong this limit. -psp- EDIT Ticket:
__________________
JD Supporter, Plugin Dev. & Community Manager
Erste Schritte & Tutorials || JDownloader 2 Setup Download Last edited by pspzockerscene; 23.09.2020 at 18:57. |
Thread Tools | |
Display Modes | |
|
|