|
#1
|
|||
|
|||
pattern issue
I did such a regex, but it does not work for me.
How to find titles only in Russian and many different match unicode. The match must match at least one character in Russian. It seems to be complicated and confusing. ^([\p{Cyrillic}]+[\-\.\,\!\…\?\(\)\„\”—0-9\s]+[\p{Cyrillic}\-\.\,\!\…\?\(\)\„\”—0-9\s]*|[\-\.\,\…\?\(\)\„\”—0-9]+[\p{Cyrillic}]+[\p{Cyrillic}\-\.\,\!\…\?\(\)\„\”—0-9\s]*)$ Text included: \p{Cyrillic} ! ? … (unicode) — (unicode) - . .. ... , 0-9 ( ) „ (unicode) „ (unicode) \s (space) \ / \x(200b} * |
#2
|
||||
|
||||
Please understand that this is no *How do I learn regex* nor a *Regex is the solution for everything* forum. Maybe you should start thinking about learning to code and not just rely on (huge, complex) regex
__________________
JD-Dev & Server-Admin |
#3
|
|||
|
|||
I want Match at least one Cyrillic character, which I assume want, otherwise it would match a long string of numbers or dashes or dots and others.
I do not want to match everything, I have to choose from the specified ones. :confused: ^ = match start of string (or line, in multiline mode) [] = match character in a set \p{Cyrillic} = match a Cyrillic character \- = match a literal - \. = match a literal . = match previous element one or more times. * = match previous element zero or more times. $ = match end of string (or line, in multiline mode) |
#4
|
||||
|
||||
As always, it's much easier if you also provide examples!
__________________
JD-Dev & Server-Admin |
#5
|
|||
|
|||
**External links are only visible to Support Staff****External links are only visible to Support Staff**
|
#6
|
||||
|
||||
Thanks, and what exactly do you want to achieve? What do you want to find?
__________________
JD-Dev & Server-Admin |
#7
|
|||
|
|||
What to say, just text.
|
#8
|
||||
|
||||
I'm sorry but I still haven't understood what you're trying to achieve via pattern
__________________
JD-Dev & Server-Admin |
#9
|
|||
|
|||
Find the text containing at least the Cyrillic alphabet and other symbols (but do not match literally everything) in one line.
|
#10
|
||||
|
||||
I'm sorry but I don't understand. find at least the Cyrillic alphabet should be possible via \p{Cyrillic}
but then you lost me. what do you mean by *in one line* If you only want line by line, then don't enable dotall nor multiline
__________________
JD-Dev & Server-Admin |
#11
|
|||
|
|||
It only matches Cyrillic, does not match the entire title.
Besides, I want to match only the title, not the name of the author and text. Eg. Line1: Title Line2: The name of the author Line3: Other name or any text. Line4: Blank line (not always) or separator "=" (always) Line5: Another title Line6: Another name of the author Another line in a similar way. |
#12
|
||||
|
||||
so basically a complete line with minimum one cryillic character?
__________________
JD-Dev & Server-Admin |
#13
|
|||
|
|||
Quote:
Example: Line1: **External links are only visible to Support Staff****External links are only visible to Support Staff** Line2: Дневник наÑтроениÑ. 25 Ğ°Ğ¿Ñ€. ĞŸĞµÑ€Ğ²Ğ°Ñ ĞºĞ»ÑƒĞ±Ğ½Ğ¸ĞºĞ° Line3: Ğ“Ğ°Ñник Ирина ĞлекÑандровна Line4: Dziennik mojego nastroju. 25 kwietnia. Pierwsza truskawka Line5: Cze¶æ, mój wierny czytelniku! Line 1 + 2: Find & Extract Line 3,4,5: Ignore |
#14
|
||||
|
||||
I'm surry but what's the difference between 2 and 3? if you always have the same format (line1 up to line5) why not simply keep line 1 and 2 and ignore the rest? why the pattern with cryillic required?
__________________
JD-Dev & Server-Admin |
#15
|
|||
|
|||
Because I have over 2000 lines, not 5, I want to extract URLs and titles. If I simply sort the text by URL order and other text, I just do not keep the order of the title names that are always stored under the next URL line.
The order from top to bottom Line:URLs Next Line:Title |
#16
|
||||
|
||||
wtf. I don't understand again. First you say *keep Line 1+2* and *ignore 3-5*, now you say oyu have 2000 lines?
how do you want to extract titles without knowing what lines are titles?
__________________
JD-Dev & Server-Admin |
#17
|
|||
|
|||
It seems to be a complicated task, but it is possible to do.
Let's say I have a separator, if he changes something in this situation? Separator (example:===) URL Title Name of the Author Text Next Separator (example:===) Next URL Next Title Next Name of the Author Next Text The further part is repeated. |
#18
|
||||
|
||||
In that case its simple. Create pattern that matches line full of = (seperator) and then use the next two lines, that's it
__________________
JD-Dev & Server-Admin |
#19
|
|||
|
|||
But the text can contain multilines, so it's still not easy. And besides, with this regular expression, it is not so easy, because instead of looking for something specific to match, almost everything match!
|
#20
|
||||
|
||||
I'm sorry. Please don't change it all the time.
First you explain the lines are URL Title Name Text ===================(Seperator) URL Title Name Text ===================(Seperator) .... and now you say *but the text can contain multilines"? multilines don't matter as long as there is a seperator where you can match on. without a unique seperator, how do you expect to seperate them?
__________________
JD-Dev & Server-Admin |
#21
|
|||
|
|||
You can consider
but nobody knows how to do it anyway. |
#22
|
||||
|
||||
please provide an example file with 100 lines or so and send to support@jdownloader.org
__________________
JD-Dev & Server-Admin |
#23
|
|||
|
|||
Ticket ID: LMN-474-68187
|
#24
|
||||
|
||||
Tested and works fine in regex101
Quote:
__________________
JD-Dev & Server-Admin |
#25
|
|||
|
|||
Thanks for the help.
Unfortunately, there is a problem with the expression. In this case, select lines 383 and 384 It does not mark the next line 385 Similarly with others lines. See screenshot: https://postimg.cc/m1s2T5Rk Used engines: Perl Regex, Regex++ Onigmo |
#26
|
|||
|
|||
Hm, maybe add the next part of the expression
Include 3 lines (?:===\s*|^)(https?:.*?)(?:[\r\n]+)(.*?)(?:[\r\n]+)(.*?)(?:[\r\n]+) |
#27
|
||||
|
||||
It's up to you if you want to include 1 or 2 lines, but now you have a working pattern
__________________
JD-Dev & Server-Admin |
#28
|
|||
|
|||
Thanks Jiaz for taking the time and help.
|
#29
|
||||
|
||||
You're welcome. I can help better and faster if you always provide real examples first and tell what you really want to achieve
__________________
JD-Dev & Server-Admin |
#30
|
|||
|
|||
New question...
OK: Find the missing dot at the end of the line in the text. Missing dot: Find the missing dot at the end of the line in the text do not look for a dot at the end of the line: I want to ignore any URLs I want to ignore any Cyrillic alphabet Something is wrong here: [^\.]\r\n(!?(http|https):\/\/[\w\-_]+(\.[\w]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?)+[А-Яа-я]) Look only in the text in Polish or English (normal text), without Unicode. |
#31
|
|||
|
|||
need some CORRECT:
\r\n(?!http|www|Ğ-Яа-Ñ?).+[^\.] |
#32
|
||||
|
||||
Sorry but I don't understand. Can you please provide new example file and tell exactly what you want to achieve. I don't understand your OK and Missing dot
__________________
JD-Dev & Server-Admin |
#33
|
|||
|
|||
I want to find sentences in which there is a missing dot at the end of the line.
This regular expression searches for sentences with a dot at the end of the line, but I want it to find the missing dot. \.{1,}$ Besides, he must ignore the links in the text, because there can not be dots at the end of the line. **External links are only visible to Support Staff****External links are only visible to Support Staff** Ğбо мне O mnie 1. Witamy na stronach mojego bloga, który staram siê uczyniæ najbardziej |
#34
|
||||
|
||||
how about
Quote:
__________________
JD-Dev & Server-Admin |
#35
|
||||
|
||||
So you want all lines without dot at the end, but no links?
__________________
JD-Dev & Server-Admin |
#36
|
|||
|
|||
Yes. No links.
Other special Character for quotations by adding the escape \ before the " in the string .*[^"\.?!:=»А-Яа-я"]$ |
#37
|
||||
|
||||
check this
Quote:
__________________
JD-Dev & Server-Admin |
#38
|
|||
|
|||
It works, but I still have a question, if I want to add different special characters, what is the order of matching?
Quote:
https://i.postimg.cc/9Fhz3PCm/Screen...t-12-57-PM.jpg |
#39
|
|||
|
|||
Recognize These Characters:
!#$%&'()*+,-./:;=?@\^_~ Character Not Allowed at End: !'(),.:;? |
#40
|
||||
|
||||
Can you again provide an example?
You want complete line without Quote:
__________________
JD-Dev & Server-Admin |
Thread Tools | |
Display Modes | |
|
|