JDownloader Community - Appwork GmbH
 

Reply
 
Thread Tools Display Modes
  #1  
Old 08.04.2019, 11:24
djmakinera djmakinera is offline
JD Legend
 
Join Date: May 2010
Location: Poland
Posts: 8,297
Default pattern issue

I did such a regex, but it does not work for me.
How to find titles only in Russian and many different match unicode. The match must match at least one character in Russian.
It seems to be complicated and confusing.


^([\p{Cyrillic}]+[\-\.\,\!\…\?\(\)\„\”—0-9\s]+[\p{Cyrillic}\-\.\,\!\…\?\(\)\„\”—0-9\s]*|[\-\.\,\…\?\(\)\„\”—0-9]+[\p{Cyrillic}]+[\p{Cyrillic}\-\.\,\!\…\?\(\)\„\”—0-9\s]*)$

Text included:
\p{Cyrillic}
!
?
… (unicode)
— (unicode)
-
.
..
...
,
0-9
(
)
„ (unicode)
„ (unicode)
\s (space)
\
/
\x(200b}
*
Reply With Quote
  #2  
Old 09.04.2019, 17:05
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 65,502
Default

Please understand that this is no *How do I learn regex* nor a *Regex is the solution for everything* forum. Maybe you should start thinking about learning to code and not just rely on (huge, complex) regex
__________________
JD-Dev & Server-Admin
Reply With Quote
  #3  
Old 09.04.2019, 17:35
djmakinera djmakinera is offline
JD Legend
 
Join Date: May 2010
Location: Poland
Posts: 8,297
Default

I want Match at least one Cyrillic character, which I assume want, otherwise it would match a long string of numbers or dashes or dots and others.
I do not want to match everything, I have to choose from the specified ones.
:confused:


^ = match start of string (or line, in multiline mode)
[] = match character in a set
\p{Cyrillic} = match a Cyrillic character
\- = match a literal -
\. = match a literal .
= match previous element one or more times.
* = match previous element zero or more times.
$ = match end of string (or line, in multiline mode)
Reply With Quote
  #4  
Old 09.04.2019, 18:07
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 65,502
Default

As always, it's much easier if you also provide examples!
__________________
JD-Dev & Server-Admin
Reply With Quote
  #5  
Old 09.04.2019, 18:18
djmakinera djmakinera is offline
JD Legend
 
Join Date: May 2010
Location: Poland
Posts: 8,297
Default

**External links are only visible to Support Staff****External links are only visible to Support Staff**
Reply With Quote
  #6  
Old 10.04.2019, 10:07
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 65,502
Default

Thanks, and what exactly do you want to achieve? What do you want to find?
__________________
JD-Dev & Server-Admin
Reply With Quote
  #7  
Old 12.04.2019, 06:14
djmakinera djmakinera is offline
JD Legend
 
Join Date: May 2010
Location: Poland
Posts: 8,297
Default

What to say, just text.
Reply With Quote
  #8  
Old 15.04.2019, 11:36
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 65,502
Default

I'm sorry but I still haven't understood what you're trying to achieve via pattern
__________________
JD-Dev & Server-Admin
Reply With Quote
  #9  
Old 15.04.2019, 13:35
djmakinera djmakinera is offline
JD Legend
 
Join Date: May 2010
Location: Poland
Posts: 8,297
Default

Find the text containing at least the Cyrillic alphabet and other symbols (but do not match literally everything) in one line.
Reply With Quote
  #10  
Old 15.04.2019, 13:58
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 65,502
Default

I'm sorry but I don't understand. find at least the Cyrillic alphabet should be possible via \p{Cyrillic}
but then you lost me. what do you mean by *in one line*
If you only want line by line, then don't enable dotall nor multiline
__________________
JD-Dev & Server-Admin
Reply With Quote
  #11  
Old 15.04.2019, 14:17
djmakinera djmakinera is offline
JD Legend
 
Join Date: May 2010
Location: Poland
Posts: 8,297
Default

It only matches Cyrillic, does not match the entire title.
Besides, I want to match only the title, not the name of the author and text.
Eg.

Line1: Title
Line2: The name of the author
Line3: Other name or any text.
Line4: Blank line (not always) or separator "=" (always)
Line5: Another title
Line6: Another name of the author

Another line in a similar way.
Reply With Quote
  #12  
Old 15.04.2019, 14:22
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 65,502
Default

so basically a complete line with minimum one cryillic character?
__________________
JD-Dev & Server-Admin
Reply With Quote
  #13  
Old 15.04.2019, 14:42
djmakinera djmakinera is offline
JD Legend
 
Join Date: May 2010
Location: Poland
Posts: 8,297
Default

Quote:
so basically a complete line with minimum one cryillic character?
Yeah.

Example:
Line1: **External links are only visible to Support Staff****External links are only visible to Support Staff**
Line2: Дневник настроения. 25 апр. Первая клубника
Line3: Гасник Ирина Александровна
Line4: Dziennik mojego nastroju. 25 kwietnia. Pierwsza truskawka
Line5: Cze, mj wierny czytelniku!

Line 1 + 2: Find & Extract
Line 3,4,5: Ignore
Reply With Quote
  #14  
Old 15.04.2019, 18:24
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 65,502
Default

I'm surry but what's the difference between 2 and 3? if you always have the same format (line1 up to line5) why not simply keep line 1 and 2 and ignore the rest? why the pattern with cryillic required?
__________________
JD-Dev & Server-Admin
Reply With Quote
  #15  
Old 15.04.2019, 19:24
djmakinera djmakinera is offline
JD Legend
 
Join Date: May 2010
Location: Poland
Posts: 8,297
Default

Because I have over 2000 lines, not 5, I want to extract URLs and titles. If I simply sort the text by URL order and other text, I just do not keep the order of the title names that are always stored under the next URL line.

The order from top to bottom
Line:URLs
Next Line:Title
Reply With Quote
  #16  
Old 16.04.2019, 10:07
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 65,502
Default

wtf. I don't understand again. First you say *keep Line 1+2* and *ignore 3-5*, now you say oyu have 2000 lines?
how do you want to extract titles without knowing what lines are titles?
__________________
JD-Dev & Server-Admin
Reply With Quote
  #17  
Old 16.04.2019, 13:18
djmakinera djmakinera is offline
JD Legend
 
Join Date: May 2010
Location: Poland
Posts: 8,297
Default

It seems to be a complicated task, but it is possible to do.
Let's say I have a separator, if he changes something in this situation?
Separator (example:===)
URL
Title
Name of the Author
Text
Next Separator (example:===)
Next URL
Next Title
Next Name of the Author
Next Text

The further part is repeated.
Reply With Quote
  #18  
Old 16.04.2019, 15:40
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 65,502
Default

In that case its simple. Create pattern that matches line full of = (seperator) and then use the next two lines, that's it
__________________
JD-Dev & Server-Admin
Reply With Quote
  #19  
Old 16.04.2019, 18:27
djmakinera djmakinera is offline
JD Legend
 
Join Date: May 2010
Location: Poland
Posts: 8,297
Default

But the text can contain multilines, so it's still not easy. And besides, with this regular expression, it is not so easy, because instead of looking for something specific to match, almost everything match!
Reply With Quote
  #20  
Old 17.04.2019, 10:16
Jiaz's Avatar
Jiaz Jiaz is offline
JD Manager
 
Join Date: Mar 2009
Location: Germany
Posts: 65,502
Default

I'm sorry. Please don't change it all the time.
First you explain the lines are
URL
Title
Name
Text
===================(Seperator)
URL
Title
Name
Text
===================(Seperator)
....

and now you say *but the text can contain multilines"?

multilines don't matter as long as there is a seperator where you can match on.
without a unique seperator, how do you expect to seperate them?
__________________
JD-Dev & Server-Admin
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

All times are GMT +2. The time now is 23:21.
Provided By AppWork GmbH | Privacy | Imprint
Parts of the Design are used from Kirsch designed by Andrew & Austin
Powered by vBulletin® Version 3.8.10 Beta 1
Copyright ©2000 - 2019, Jelsoft Enterprises Ltd.