JDownloader Community - Appwork GmbH
 

Reply
 
Thread Tools Display Modes
  #1  
Old 31.07.2021, 03:07
XtremeHairball XtremeHairball is offline
Baby Loader
 
Join Date: May 2021
Location: United States
Posts: 5
Default <jd:prop:date> for Tumblr.com

Hi,

I was wondering if it would be possible to make the <jd:prop:date> tag compatible with Tumblr in the packagizer rules. It would be nice to be able to keep track of how old posts are after downloading them.

The dates seem to be easily visible both on the Archive page of an account as well as in the page source of each individual post. Hopefully that makes this a simple request, although I don't know exactly how hard of a task it is to add packagizer tag compatibility.

For a quick example of a Tumblr archive page: **External links are only visible to Support Staff****External links are only visible to Support Staff**

It's very easy to navigate posts by date from this page, and it would be great if that convenience could be taken offline via downloading them with a date in the filename. You can also see the date of specific posts by mousing over them. This specific account also shows the date on each post when you click into one, and all Tumblrs (as far as I can tell) have datePublished inside the page source.

If not, then I'd also be glad to make use of some other workaround if someone could point me in the right direction. I've usually had success with figuring my own workarounds by reading around on the forums, but I haven't been able to find a close reference point on this subject, other than that some websites seem to have had this tag made compatible in the packagizer.
Reply With Quote
  #2  
Old 02.08.2021, 15:26
pspzockerscene's Avatar
pspzockerscene pspzockerscene is offline
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 58,962
Default

Hi
Your supposed tumblr.com example URL is not a tumblr.com URL but leads to "cutepetclub.com".

Please provide real tunblr.com example URLs.
Adding the property should be easily possible.

-psp-
__________________
JD Supporter, Plugin Dev. & Community Manager
JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
That's true James
Quote:
Originally Posted by James
Die Leute verstehen einfach nicht dass nur weil man mit einer Waffe auch auf Menschen schießen kann dass ein Schützenver​ein kein Ort für Amoklaufide​en ist
Reply With Quote
  #3  
Old 02.08.2021, 16:07
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 72,273
Default

@pspzockerscene: sounds like he wants the website to be supported?

@XtremeHairball: is this really about tumblr or about this archive site? the archive site is not supported so you would have to crawl/add those by yourself?
__________________
JD-Dev & Server-Admin
Reply With Quote
  #4  
Old 08.08.2021, 18:44
XtremeHairball XtremeHairball is offline
Baby Loader
 
Join Date: May 2021
Location: United States
Posts: 5
Default

Hi,

Sorry for the late response.

I'm also sorry about the issue with the URL. It actually is a Tumblr account, but somehow I failed to consider that "tumblr" wasn't in the URL. Just to clarify, searching the account name from Tumblr will lead to that account's page. Clicking the "Follow" button in the top right corner of the /archive page will add that account to your followed users if you are logged in to Tumblr, otherwise it will take you to the Tumblr login page so that you can follow after logging in. And just for good measure, changing the ".com" portion of the URL to ".tumblr.com" will still lead to the exact same page, but it will then remove the ".tumblr" portion after the page loads. It was definitely a mistake on my part to not notice/consider that when I posted, but I hope it makes some sense how I ended up making that mistake.

My main intent was to find a fun tumblr to use as an example, but I also see now that the Linkcrawler doesn't seem to crawl this one as a normal Tumblr page, I'm guessing due to the absence of "tumblr.com" in the URL. So that whole thing was a big blunder on my part. I'll provide some better examples this time.

Main page and archive pages included (in case that helps for identifying the dates or anything):

**External links are only visible to Support Staff****External links are only visible to Support Staff**
**External links are only visible to Support Staff****External links are only visible to Support Staff**

**External links are only visible to Support Staff****External links are only visible to Support Staff**
**External links are only visible to Support Staff****External links are only visible to Support Staff**


And a substitute cute-animals one just to replace the one from before:
**External links are only visible to Support Staff****External links are only visible to Support Staff**
**External links are only visible to Support Staff****External links are only visible to Support Staff**


I also have one more that doesn't have "tumblr.com" in the URL. I'm not trying to request compatibility for them, but I thought you might like to see that there are other instances where the same thing happens with the URL, although the layouts and archive pages for them are still those of tumblr (and if you check the page source, you can clearly see Tumblr links and assets throughout as well).

**External links are only visible to Support Staff****External links are only visible to Support Staff**
**External links are only visible to Support Staff****External links are only visible to Support Staff**


I'm not sure why some of them have their own URL as if they were not a part of Tumblr. I've known about that for some time and I just sort of accepted it. Sorry again for the confusion. Hopefully this clears a few things up.
Reply With Quote
  #5  
Old 09.08.2021, 17:30
pspzockerscene's Avatar
pspzockerscene pspzockerscene is offline
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 58,962
Default

Quote:
Originally Posted by XtremeHairball View Post
I'm also sorry about the issue with the URL. It actually is a Tumblr account, but somehow I failed to consider that "tumblr" wasn't in the URL. Just to clarify, searching the account name from Tumblr will lead to that account's page. Clicking the "Follow" button in the top right corner of the /archive page will add that account to your followed users if you are logged in to Tumblr, otherwise it will take you to the Tumblr login page so that you can follow after logging in. And just for good measure, changing the ".com" portion of the URL to ".tumblr.com" will still lead to the exact same page, but it will then remove the ".tumblr" portion after the page loads. It was definitely a mistake on my part to not notice/consider that when I posted, but I hope it makes some sense how I ended up making that mistake.
Yeah I'm sorry but neither our plugins nor me could know/recognize that these were tumblr.com URLs.
Here are the types of tumblr URLs (as a "regular expression") our plugins can handle atm.:
Code:
https?://(?![a-z0-9]+\\.media\\.tumblr\\.com/.+)[\\w\\.\\-]+?tumblr\\.com(?:/image/\\d+|/post/\\d+|/likes|/?$|/blog/view/[^/]+(?:/\\d+)?)(?:\\?password=.+)?
At this moment we do not support "/archive" URLs anymore but we do support crawling complete tumblr.com user profiles (= all posts of a user except posts that only contain text-content) --> Basically all videos/photos.
Does "username.tumblr.com/archive" return content which is not accessible when looking up/crawling everything from "username.tumblr.com/"?

After the next update, you can get the date of tumblr.com media via "<jd:prop:date>" but only for URLs added after the update.

Also important:
The tumblr.com website does not seem to expose this information so you will have to add your tumblr.com account to JD first so JD will use their API for crawling which does provide that information.

Wartest du auf einen angekündigten Bugfix oder ein neues Feature?
Updates werden nicht immer sofort bereitgestellt!
Bitte lies unser Update FAQ! | Please read our Update FAQ!

---
Are you waiting for recently announced changes to get released?
Updates to not necessarily get released immediately!
Bitte lies unser Update FAQ! | Please read our Update FAQ!


-psp-
__________________
JD Supporter, Plugin Dev. & Community Manager
JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
That's true James
Quote:
Originally Posted by James
Die Leute verstehen einfach nicht dass nur weil man mit einer Waffe auch auf Menschen schießen kann dass ein Schützenver​ein kein Ort für Amoklaufide​en ist
Reply With Quote
  #6  
Old 12.08.2021, 15:42
XtremeHairball XtremeHairball is offline
Baby Loader
 
Join Date: May 2021
Location: United States
Posts: 5
Default

Hi,

I'll make some brief replies, but I'd also like to mention a workaround that I've patched together, and also a possible bug with the Tumblr plugin after the update.


Quick Replies:
Quote:
Originally Posted by pspzockerscene View Post
Yeah I'm sorry but neither our plugins nor me could know/recognize that these were tumblr.com URLs.
I understand. I just didn't take everything into proper consideration before I posted. Sorry again. I figured out a workaround that seems to work for at least some of those oddly named ones. I'll detail that some more later just in case it helps anyone.


Quote:
Does "username.tumblr.com/archive" return content which is not accessible when looking up/crawling everything from "username.tumblr.com/"?
I don't think that there is any missing content based on the few that I've tried, so that should be fine. I only mentioned the /archive pages because I thought it might be easier to work with since it's so compact and organized.


Quote:
After the next update, you can get the date of tumblr.com media via "<jd:prop:date>" but only for URLs added after the update.
Fantastic. I tried it and it definitely is grabbing the dates.


Quote:
Also important:
The tumblr.com website does not seem to expose this information so you will have to add your tumblr.com account to JD first so JD will use their API for crawling which does provide that information.
I actually went ahead and tried it with my account login disabled, and it did manage get the dates correctly even without account access.


Possible Bug:
Like I said above, some of the same Tumblrs that I crawled successfully just within the past couple of weeks, now seem to be uncrawlable. I tried to recrawl a couple that I had already crawled previously (in order to get copies with the dates attached), but they don't seem to work anymore. The linkcrawler stops after about one second without returning any results.

Even more confusing is that I actually got the crawler to work on both of the non-standard Tumblr URLs that I mentioned before. I had to make a linkcrawler rule for each one individually, but I think it might be possible to make a universal one also. I'll play with that idea for a bit and post back here if I get something to work (but I might just give up since those are such unusual cases). Basically, I found out that those accounts without .tumblr in the URL actually have alternate URLs that do include .tumblr, so I manually made a crawler rule to replace the URLs with the alternate versions. Then it crawled just like a normal Tumblr page.


Workaround for Grabbing non-Tumblr Tumblrs:
Just for reference in case anyone wants to try something like this themselves, this is the rule that I used to successfully crawl the cutepetclub Tumblr:
Code:
{
  "enabled" : true,
  "cookies" : null,
  "updateCookies" : true,
  "logging" : false,
  "maxDecryptDepth" : 1,
  "id" : 1628747605452,
  "name" : "Tumblr Rewrite",
  "pattern" : "(?:http)?s?://(.*)(cutepetclub).com/(.*)",
  "rule" : "REWRITE",
  "packageNamePattern" : null,
  "passwordPattern" : null,
  "formPattern" : null,
  "deepPattern" : null,
  "rewriteReplaceWith" : "**External links are only visible to Support Staff**
}
I know that the regular expressions are a bit sloppy, but they work. I'm not sure how to set up the package naming through linkcrawler rules, so I just made a roundabout fix with the packagizer. Screenshots for the packagizer and the results are included just for reference.

I changed the expressions just a tiny bit and changed the URLs to work for another Tumblr. I saw on another forum that these standalone Tumblr-based sites have an actual tumblr URL alternative. The fastest way to find it is by going into the page source and searching the page for "blogname", onto which you can just append the Tumblr domain. The linkcrawler can apparently function as normal.
Code:
{
  "enabled" : true,
  "cookies" : null,
  "updateCookies" : true,
  "logging" : false,
  "maxDecryptDepth" : 1,
  "id" : 1628772989892,
  "name" : "Tumblr Rewrite",
  "pattern" : "(?:http)?s?://(.*)(fairies-fairytales).com/(.*)",
  "rule" : "REWRITE",
  "packageNamePattern" : null,
  "passwordPattern" : null,
  "formPattern" : null,
  "deepPattern" : null,
  "rewriteReplaceWith" : "**External links are only visible to Support Staff**
}
The packagizer rule stayed the same, and the linkcrawler grabbed about 2400 links (and a few stray YouTube videos that weren't packaged correctly)


Summary:
Like I said, even after getting the two odd URLs to work (even with the dates), the standard ones seem to have stopped working. Even when I disable my linkcrawler rules and packagizer rules, they still fail to find anything (or just a few files at most). I'm not sure if this occurred with the update to the date grabbing, or if I've broken something somehow. I just find it strange that it's only the standard URLs that are misbehaving for some reason.
Specifically:
**External links are only visible to Support Staff****External links are only visible to Support Staff**
**External links are only visible to Support Staff****External links are only visible to Support Staff**
Neither of these seems to work anymore, along with a few others. I've also confirmed that the links are not in my downloads list nor in my LinkGrabber list, and they aren't being filtered out in either of those places.

(I'm unsure if this is relevant, but I also don't notice any difference in the filenames when I select the Tumblr plugin option to use "original file names". This is probably unrelated, and maybe I just don't understand what it's meant to do, but I thought I might as well mention it since I'm typing so much right now anyway.)

Other than the plugin no longer working for the normal URLs, everything else seems fine. I'm not sure if it's possible for the linkcrawler to find the alternate "blogName" (it seems to be in several parts of the page source code under different labels, but I don't know if that's accessible), but I'm happy to play around with linkcrawler rules on my own to try to work it out.

Sorry for the long post. Just trying to be comprehensive so as to not cause any confusion this time.


Edit: I tried adding links to a bunch of random Tumblrs. Some of them do work, and some of them don't. I know that I'm supposed to have 20+ examples, so I'll try to test various Tumblrs I find and group them here.

Links that seem to crawl all files (Sometimes too many to manually count to confirm):
Spoiler:

**External links are only visible to Support Staff****External links are only visible to Support Staff** (with my own linkcrawler rule)
fairies-fairytales.com (with my own linkcrawler rule)
**External links are only visible to Support Staff****External links are only visible to Support Staff** (thousands grabbed)
**External links are only visible to Support Staff****External links are only visible to Support Staff** (thousands grabbed)

Edit #2: After pspzockerscene's update, I realize that these were actually not working perfectly either, but everything seems to be fixed now.


Links that crawl few/no files:
Spoiler:

**External links are only visible to Support Staff****External links are only visible to Support Staff**
**External links are only visible to Support Staff****External links are only visible to Support Staff**
**External links are only visible to Support Staff****External links are only visible to Support Staff**
**External links are only visible to Support Staff****External links are only visible to Support Staff**
**External links are only visible to Support Staff****External links are only visible to Support Staff**
**External links are only visible to Support Staff****External links are only visible to Support Staff**
**External links are only visible to Support Staff****External links are only visible to Support Staff**
**External links are only visible to Support Staff****External links are only visible to Support Staff**
**External links are only visible to Support Staff****External links are only visible to Support Staff**
**External links are only visible to Support Staff****External links are only visible to Support Staff**
**External links are only visible to Support Staff****External links are only visible to Support Staff**
**External links are only visible to Support Staff****External links are only visible to Support Staff**
**External links are only visible to Support Staff****External links are only visible to Support Staff**
**External links are only visible to Support Staff****External links are only visible to Support Staff**
**External links are only visible to Support Staff****External links are only visible to Support Staff**
**External links are only visible to Support Staff****External links are only visible to Support Staff**


I have definitely been able to grab Tumblrs successfully very recently. If it's not something gone wrong with the update, then I have no idea what might be the issue.
Attached Images
File Type: png PackRule.png (71.7 KB, 4 views)
File Type: png Pack.png (159.8 KB, 2 views)

Last edited by XtremeHairball; 12.08.2021 at 17:02. Reason: Adding More Example URLs
Reply With Quote
  #7  
Old 12.08.2021, 16:10
pspzockerscene's Avatar
pspzockerscene pspzockerscene is offline
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 58,962
Default

Hi,
1. Your "non-tumblr workaround" via LinkCrawler rule is just fine

2. Indeed I've assumed that the timestamp was available for every object which lead to an NPE (NullPointerException) so the crawler failed for some URLs.
The next update fixes this and only sets the date when available.

Packagenames:
Are you trying to put all items of a user into one package?
Simply add a rule like:
If sourceurl(s) ->
Code:
https?://(.*?)\\.tumblr\\.com/.+
--> And enable RegEx checkbox.
Then set Package Name ->
Code:
<jd:source:1>
Wartest du auf einen angekündigten Bugfix oder ein neues Feature?
Updates werden nicht immer sofort bereitgestellt!
Bitte lies unser Update FAQ! | Please read our Update FAQ!

---
Are you waiting for recently announced changes to get released?
Updates to not necessarily get released immediately!
Bitte lies unser Update FAQ! | Please read our Update FAQ!


-psp-
__________________
JD Supporter, Plugin Dev. & Community Manager
JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
That's true James
Quote:
Originally Posted by James
Die Leute verstehen einfach nicht dass nur weil man mit einer Waffe auch auf Menschen schießen kann dass ein Schützenver​ein kein Ort für Amoklaufide​en ist
Reply With Quote
  #8  
Old 12.08.2021, 19:05
XtremeHairball XtremeHairball is offline
Baby Loader
 
Join Date: May 2021
Location: United States
Posts: 5
Thumbs up

Quote:
Originally Posted by pspzockerscene View Post
2. Indeed I've assumed that the timestamp was available for every object which lead to an NPE (NullPointerException) so the crawler failed for some URLs.
The next update fixes this and only sets the date when available.
Ah alright. I also would have assumed that every item had a date available. That might stem from my relying too much on the /archive pages though, since there are timestamps there. I don't fully understand how the LinkCrawler works and what information is available to it from the base URL.

And thank you for the lightning-fast fix. You must have posted your reply right as I started editing my previous post. I went through a lot of random Tumblrs for a while to test them, hoping to help with debugging, but you were already finished, haha


Quote:
Originally Posted by pspzockerscene View Post
Packagenames:
Are you trying to put all items of a user into one package?
Simply add a rule like:
If sourceurl(s) ->
Code:
https?://(.*?)\\.tumblr\\.com/.+
--> And enable RegEx checkbox.
Then set Package Name ->
Code:
<jd:source:1>
I actually haven't been able to get this to work. When I entered it exactly as you have, the packagizer didn't find any matches. I tried disabling all of the other fields and just using the two that you suggested, but it then sorts most files into packages called "64.media" and "va.media", which is part of the data URL, not the source/input URL. I'm not sure why that happens. Could the LinkCrawler be collecting them all as new sources? Or I may have just done something wrong.
I tweaked and retried it a few times on a couple of URLs but I kept getting the same result. The URL I used for most of that time was
**External links are only visible to Support Staff****External links are only visible to Support Staff**

I did change my previous package naming a bit because it wasn't able to include hyphens in usernames (like the madcat one above). Now I have it set as this:
Package name contains:
Code:
(^\S*)? - (.*)
This is always a match for the default Tumblr package names, so it should always capture the username for later use. Could break if the default package names ever change, but it works well for now.

and as you suggested,
Sourceurl(s) contains:
Code:
https?://(.*?).tumblr.com/.+
Just to make the rule Tumblr-only. I'm not using the captured variable, but I like this much better than the mess I had before.

Package name:
Code:
<jd:orgpackagename:1>
This seems to work well for capturing only the full username

File name:
Code:
<jd:prop:date>_<jd:orgfilename>
Comment:
Code:
<jd:orgpackagename:2>
The default package name includes a bit of the comment/description from each post, so I figure I might as well save it; some of them might be worth saving into files' exif or xmp. Setting the comment this way seems to limit it to a small number of characters. It also makes every letter lowercase and removes all punctuation, but it's still fun to have something rather than nothing.

But most of that is irrelevant at this point. I'm just glad that things are working mostly the way I wanted them to, excluding just some very minor things like a handful of missing timestamps and some other packagizer things that I may need to work out. Overall, I'm happy.

Thanks so much for the help. I've visited the JD boards many times over the years just to read/troubleshoot, and I'm always impressed with the amount of user support here. I've got a bit of coding experience (nothing professional) so I know that debugging and feature-adding can be a hassle, yet you guys seem attentive to thousands of requests. I just wanted to say that I really respect and appreciate that.

Thanks again
Reply With Quote
  #9  
Old 13.08.2021, 01:11
pspzockerscene's Avatar
pspzockerscene pspzockerscene is offline
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 58,962
Default

Quote:
Originally Posted by XtremeHairball View Post
Ah alright. I also would have assumed that every item had a date available.
Sorry I've made a stupid mistake here!
Indeed it would be quite strange if the date was not available for all items...
After the next update, the date will be available for all items!

Quote:
Originally Posted by XtremeHairball View Post
I actually haven't been able to get this to work. When I entered it exactly as you have, the packagizer didn't find any matches.
The escaping in my rule was also wrong...
Here is a screenshot of a working rule:


Quote:
Originally Posted by pspzockerscene View Post
I just wanted to say that I really respect and appreciate that.
Thanks for your nice feedback

-psp-
__________________
JD Supporter, Plugin Dev. & Community Manager
JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
That's true James
Quote:
Originally Posted by James
Die Leute verstehen einfach nicht dass nur weil man mit einer Waffe auch auf Menschen schießen kann dass ein Schützenver​ein kein Ort für Amoklaufide​en ist

Last edited by pspzockerscene; 13.08.2021 at 01:12. Reason: Fixed quotes
Reply With Quote
  #10  
Old 14.08.2021, 06:22
XtremeHairball XtremeHairball is offline
Baby Loader
 
Join Date: May 2021
Location: United States
Posts: 5
Default

Quote:
Originally Posted by pspzockerscene View Post
Sorry I've made a stupid mistake here!
Indeed it would be quite strange if the date was not available for all items...
After the next update, the date will be available for all items!
Wow. Well, now it works perfectly. Whatever you did, thank you


Quote:
Originally Posted by pspzockerscene View Post
The escaping in my rule was also wrong...
Here is a screenshot of a working rule:
I did notice that the \\ escape doesn't work in JDownloader's packagizer. I assumed that it was just there so that it wouldn't be parsed as a link in your post (since that would make it hidden from me) or that you were just thinking of a different syntax or something where double escapes are sometimes used. I tried to fix it by removing all the backslashes, which didn't work, obviously. In hindsight, I'm not sure why I thought it was a good idea to remove all of them. I'm not fluent in regular expressions, but that was such a simple change that I should have been able to recognize it (I also notice that you replaced the + with a * at the end. I'm not sure if that also made a difference, but either way, it works now)

I guess that wraps everything up. Thanks so much again.
Reply With Quote
  #11  
Old 16.08.2021, 13:39
pspzockerscene's Avatar
pspzockerscene pspzockerscene is offline
Community Manager
 
Join Date: Mar 2009
Location: Deutschland
Posts: 58,962
Default

Thanks for your feedback!

-psp-
__________________
JD Supporter, Plugin Dev. & Community Manager
JDownloader 2 Setup Download
Spoiler:

A users' JD crashes and the first thing to ask is:
Quote:
Originally Posted by Jiaz View Post
Do you have Nero installed?
That's true James
Quote:
Originally Posted by James
Die Leute verstehen einfach nicht dass nur weil man mit einer Waffe auch auf Menschen schießen kann dass ein Schützenver​ein kein Ort für Amoklaufide​en ist
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

All times are GMT +2. The time now is 18:39.
Provided By AppWork GmbH | Privacy | Imprint
Parts of the Design are used from Kirsch designed by Andrew & Austin
Powered by vBulletin® Version 3.8.10 Beta 1
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.